FMT - Project

Domain:

Semiconductor manufacturing process

Conetxt:

A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/ variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

Data Description:

The data consists of 1567 examples each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

Project Objective:

We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

EDA and Data Pre-processing

Majority of the attributes have outliers, will be replacing them with the median

Some variables have still have 0 as a constant signal, will be drooping them after scaling with z-score

Majority of the variable seem to have a normal distribution

Trying various algorithms along with different sampling techniques

Splitting the Past data into train and test 70:30 ratio

No sampling

Random undersampling

Smote

Random Oversampling

ADASYN sampling

Gaussian Naive Bayes on Normal Dataset

Gaussian Naive Bayes on Under sampled Data

LightGBM on Smote sampled Dataset

RandomForest on Random over sampled Dataset

LightGBM on ADASYN sampled Dataset

Fitting the resuts on the validation/future dataset

Observations:

1) Gaussian Naive bayes without any sampling gives a sensitivity of 48% with a type 2 error rate of 52%, while predicting 9 observation to have failed, adding   a threshold of 0.016 gives a sensitivity of 71% and reduces type 2 error rate by 23% with and increase in type 1 error rate by 17% while predicting 14
 observations to have failed.
2) Random Forest with Randomly over sampled data gives 100% specificity and type 2 error while predicting 1 observation to have failed, adding a threshold   of 0.1688 gives a sensitivity of 71% and type 2 error rate reduces by 71% with and increase in type 1 error rate by 27% while predicting 7 observations as
  failed.
3) Futher more we could have tried modeling the data based on only the important variable from variable importance plot to check if performance of the model   increases.

PCA to check if the dimensions can be reduced future more and rebuild the models as above

Trying various algorithms along with different sampling techniques

Splitting the Past data into train and test 70:30 ratio

Random Undersampling

Smote

Random Oversampling

ADASYN sampling

Logistic Regression with Normal dataset

SVM with Under Sampled Dataset

LightGBM with Smote Sampled Dataset

Random Forest with Over sampled Dataset

XGboost with ADASYN sampled Dataset

Fitting thr result on validation/future dataset

Observations:

1) SVM with random under sampling gives a sensitivity of 29% with a type 2 error rate of 71%, while predicting 5 observation to have failed, adding a
  threshold of 0.3555 gives a sensitivity of 65% and reduces type 2 error rate by 50% with and increase in type 1 error rate by 27% while predicting 11
  observations to have failed.

On an unseen data Random Forest without PCA ,using random oversampling would give an accuracy of between 96.9% to 99.6%, 95% of the times

On an unseen data SVM with PCA ,using random undersampling would give an accuracy of between 48% to 91%, 95% of the times

Conclusion:

Based on the overall analysis and performances of the model we can go ahead and narrow down the observation that are common to the all the validations sets and infer that these are likely to fail. This definetly needs to be checked with a domain specialist to get an acceptable threshold limit of the Type 2 error rate and would be able choose the best performing model based on that, As far as PCA is considered, the models does better without it, Random Forest with Random overrsampling gives the best overall results.